IMS Collections 

Probability and Statistics: Essays in Honor of David A. Freedman 

Vol. 2 (2008) 302-315 

@ Institute of Mathematical Statistics, 2008 
DOI: 10.1214/193940307000000509 



Testing earthquake predictions 

Brad Luen^ and Philip B. Stark^ 

University of California, Berkeley 

Abstract: Statistical tests of earthquake predictions require a null hypothesis 
to model occasional chance successes. To define and quantify 'chance success' 
is knotty. Some null hypotheses ascribe chance to the Earth: Seismicity is mod- 
eled as random. The null distribution of the number of successful predictions - 
or any other test statistic — is taken to be its distribution when the fixed set of 
predictions is applied to random seismicity. Such tests tacitly assume that the 
predictions do not depend on the observed seismicity. Conditioning on the pre- 
dictions in this way sets a low hurdle for statistical significance. Consider this 
scheme: When an earthquake of magnitude 5.5 or greater occurs anywhere 
in the world, predict that an earthquake at least as large will occur within 
21 days and within an epicentral distance of 50 km. We apply this rule to the 
Harvard centroid-moment-tensor (CMT) catalog for 2000-2004 to generate a 
set of predictions. The null hypothesis is that earthquake times are exchange- 
able conditional on their magnitudes and locations and on the predictions— a 
common "nonparametric" assumption in the literature. We generate random 
seismicity by permuting the times of events in the CMT catalog. We consider 
an event successfully predicted only if (i) it is predicted and (ii) there is no 
larger event within 50 km in the previous 21 days. The P-value for the observed 
success rate is < 0.001: The method successfully predicts about 5% of earth- 
quakes, far better than 'chance,' because the predictor exploits the clustering 
of earthquakes — occasional foreshocks - which the null hypothesis lacks. Rather 
than condition on the predictions and use a stochastic model for seismicity, 
it is preferable to treat the observed seismicity as fixed, and to compare the 
success rate of the predictions to the success rate of simple-minded predictions 
like those just described. If the proffered predictions do no better than a simple 
scheme, they have little value. 



1. Introduction 

Earthquake prediction has roots in antiquity [4i)]. Predictions have been based on 
a variety of seismic and non-seismic phenomena, including animal behavior [7, IS, 
2-5, 2(), 40]; water level, temperature and composition in wells and springs [2, .'SC]; 
electric and magnetic fields and radio waves on the ground and in the air [.■), ")()]; 
electrical resistivity of the ground and the air [.3, 30, 53]; cloud formations or other 
atmospheric phenomena [41, 46]; infrared radiation [3]; pattern recognition [4, 24]; 
temporal clustering [11, 15]; and variations in the rate or pattern of seismicity [10]. 

There are large research efforts directed towards predicting earthquakes, such 
as the Collaboratory for the Study of Earthquake Predictability.^ "Even a stopped 
clock is right twice a day," and almost any method for predicting earthquakes will 
succeed occasionally — whether the method has merit or not. Indeed, prominent 
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geophysicists disagree about whether earthquake prediction is possible in principle.'^ 
How, then, ought we decide whether a method for predicting earthquakes works? 

Earthquake predictions have been assessed using ideas from statistical hypothesis 
testing: A test statistic is compared to its distribution under a null hypothesis [10, 
11, 19, 22, 28, 40, 55, 57]. The null hypothesis is rejected at significance level a if the 
test statistic exceeds the 1 — a quantile of its distribution under the null hypothesis. 
If the null hypothesis is rejected, researchers tend to conclude — erroneously — that 
the predictions must have merit. 

The null hypothesis can be rejected for many reasons. A Type I error might 
occur. Or the null hypothesis could be false, but in a way that docs not imply that 
the predictions work. For example, the null hypothesis might use a poor model for 
seismicity. Or the null hypothesis might not account for how the predictions depend 
on seismicity. We explore that possibility below. 

Conclusions ultimately depend on details of the null hypothesis that can be 
difficult to justify, or that are known to contradict the data. For example, Freedman 
and Stark"* [l'.'>] argue that standard interpretations of probability do not make 
sense for earthquakes - especially for large events, the most important to predict. 
For rare events, such as large earthquakes, there are not enough data to test or 
discriminate among competing stochastic models. Models are often calibrated using 
empirical scaling laws that tie the rates of occurence of large earthquakes to the 
rates for smaller earthquakes. Generally, these rules of thumb are themselves fitted 
to data from other parts of the world: applying them to a region as small as the 
San Francisco Bay area, for example, is questionable. Thus, stochastic models for 
earthquake occurrence do not seem like a good foundation for evaluating earthquake 
predictions, especially predictions of large earthquakes. 

Moreover, Stark [42, 43] argues that testing predictions using a stochastic model 
for seismicity and conditioning on the predictions tends to be misleading and that it 
is preferable to treat seismicity as fixed and compare the success of the predictions 
with the success of a simple rule. Consider rain forecasts as an analogy. The rule 
"if it rains today, predict that it will rain tomorrow; otherwise, predict that it will 
not rain tomorrow" works pretty well. If a meteorologist cannot do better, the 
meteorologist's predictions have little value. 

The seismic analogue is "if there is an earthquake with magnitude greater than 
the threshold Mr, predict that there will be an earthquake with magnitude M or 
above within time t and distance d of the first." Here, M , t and d might depend 
on the location or magnitude of the first earthquake. Kagan [20] uses this "au- 
tomatic alarm" strategy to evaluate earthquake predictions for Greece (the VAN 
predictions). The approach can also include a stochastic element to make a "semi- 
automatic alarm" strategy: Stark [42, 43] compares the VAN predictions to the rule: 
"If there is an earthquake with magnitude > Mr, toss a (biased) coin. If the coin 
lands heads, predict that there will be an earthquake with magnitude > M within 
time t and distance d of the first. If the coin lands tails, do not make a prediction." 

2. Phenomenology of earthquakes 

Sec Bolt [5] for a lay review. The epicenter of an earthquake is the point on Earth's 
surface directly above the earthquake's focus, the place that the motion nucleates. 

^See, e.g., http://www.nature.com/nature/debates/earthquake/equake_fraineset.html and 
[11], 

*A preprint is available at http://statistics.berkeley.edu/tech-reports/611.pdf. 
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Epicenters and foci are not known exactly: They are estimated from ground motion 
at seismographic observing stations around the globe. Sizes of earthquakes are also 
estimated from ground motion measured at seismographic stations. There are many 
measures of earthquake size, including several definitions of "magnitude." 

An earthquake catalog is a list of the estimated foci, times, and magnitudes of 
earthquakes found by a given authority, such as the U.S. Geological Survey. Earth- 
quake catalogs are incomplete below some magnitude (left-censored in magnitude) 
because smaller events are harder to identify and locate. Moreover, unless some 
minimum number of stations detect ground motion, the algorithms used to locate 
earthquakes do not even conclude that there was an earthquake. (The incomplete- 
less as a function of magnitude tends to decrease with time, as equipment becomes 
more sensitive and networks more extensive.) 

Earthquakes occur to depths of 700 km or so in Earth's mantle [45]; however, 
most earthquakes and almost all large earthquakes occur within a few tens of kilo- 
meters of the surface. Earthquakes cluster in space. Most earthquakes occur on pre- 
existing faults. With very few known exceptions, epicenters of large earthquakes are 
close to the margins of tectonic plates, because it takes large strains — the relative 
motions of plates — to produce large earthquakes. Indeed, most large earthquakes 
occur in a relatively narrow band around the Pacific Ocean, the "ring of fire." 

Earthquakes also cluster in time: Large earthquakes almost invariably have after- 
shocks; some have foreshocks; and there are "swarms" of moderatc-to-large earth- 
quakes. Defining foreshocks and aftershocks is difficult. The terms "foreshock" and 
"aftershock" imply a causal connection to a main shock. Unfortunately, earthquake 
physics is largely a mystery. Proximity in space and time can be coincidental rather 
than causal. One cannot tell whether an earthquake is the main shock or a foreshock 
of a larger event except — at best — in retrospect.'^ And stochastic models for earth- 
quakes can produce spatiotemporal clustering without foreshocks or aftershocks per 
se (for example. Gamma renewal models [4(S]). 

The most common stochastic model for seismicity takes the epicenters and times 
of shocks'' above some threshold magnitude to be a realization of a spatially inho- 
mogcneous but temporally homogeneous Poisson process. The spatial heterogeneity 
reflects tectonics: Some regions are more active seismically than others. The tem- 
poral homogeneity is justified by appeal to the lengthy time scale of plate tectonics 
(tens of thousands of years) relative to the time scale of observation, which is on 
the order of centuries. The seminal reference on stochastic models for seismicity is 
Vere- Jones [52], which considers temporal and marked temporal processes, but not 
spatial processes. Some recent models use branching processes [33, 34, 35]. 

3. Tests of earthquake predictions 

There are two categories of earthquake predictions: deterministic or binary predic- 
tions, which are of the form "there will be an earthquake of magnitude > 6 within 
100 km of San Francisco, CA, within the next 30 days;" and probabilistic predic- 
tions, which arc probability distributions, or statements of the form "there is a 90% 
chance of an earthquake of magnitude > 6 within 100 km of San Francisco, CA, in 

^Identifying an event as a foreshock or aftershock is a causal inference based on association in 
time and space. Causal conclusions from associations in non-experimental data are highly suspect. 
See, e.g., Freedman [12]. 

^Sometimes this is restricted to main shocks, which are difficult to distinguish from foreshocks 
and aftershocks, as noted above. 



Testing earthquake predictions 



305 



the next 30 days." Freedman and Stark [l'>] point out some difficulties in interpret- 
ing probabilistic predictions, using the USGS prediction for the San Francisco Bay 
Area as an example; here we concentrate on deterministic predictions. 

To keep the exposition simple, we take the goal to be to predict all earthquakes 
that exceed some threshold magnitude M , that have epicenters in some region R 
of Earth's surface, and that occur during some time period T. We examine several 
statistical approaches to testing whether predictions have merit. ' 

Let Q denote the total number of earthquakes of magnitude > M with epicenters 
in R during T. Let A denote the number of alarms (predictions). The jth alarm is 
characterized by Vj, a connected region of space, time and magnitude, and a value 
Pj, the probability the alarm assigns to the occurrence of an earthquake in Vj. 
Deterministic predictions take pj = 1. They assert that an earthquake will occur in 
Vj. Probabilistic forecasts assign a probability pj S (0, 1) to the occurrence of an 
earthquake in Vj . Let 6j be the duration of Vj . When 5j is on the order of weeks, Vj 
is generally considered a "prediction." When Sj is on the order of a year or more, 
Vj is generally considered a "forecast." However, some authors use "prediction" 
to mean deterministic prediction and "forecast" to mean probabilistic prediction, 
regardless of the time horizon. 

Let Xj denote the historical rate of earthquakes in the spatial and magnitude 
range — but not the temporal range — covered by Vj.^ The historical rates {Xj}f^i 
enter into some tests, as we shall see. Let Sj indicate whether the jth alarm is 
successful: 



(1) 



3 



1, if there is an earthquake in Vj, 
0, otherwise; 



and for fc = 1, . . . , Q let indicate whether the fcth earthquake is predicted: 

(2) Pk = 



1, if the fcth event is in some Vj, j ^ 1, . . . , A, 
0, otherwise. 



Let S = J2f=i denote the number of successful alarms, and let P = X^sLi 



k 



denote the number earthquakes that are predicted successfully. The number of 
false alarms is F = A — S and the number of earthquakes that are missed — not 
predicted — is M = Q — P . Let V be the volume of space-time studied. 



(3) V= drdt, 

Jr,t 

and let Va denote the total space-time volume of alarms. 



(4) Va = / drdt. 

The fraction of the study volume covered by alarms is u = Va/V. Generally, the 
smaller v is, the more informative the alarms are, but this can be distorted by 



''Statistical terminology is used in some unfamiliar ways in the geophysical literature. For 
example, "significance" and "confidence" sometimes denote 100% minus the P-value, rather than 
the chance of a type I error for a fixed-size test (e.g., [28], p. 193 and [0], pp. 723, 731, which also 
confuses the P-value with the chance that the null hypothesis is true). "Random probabilities" 
are sometimes fixed parameters [1!)], p. 3773, and "parameters" sometimes means statistics [8], 
p. 263. 

* Typically, Xj is the empirical rate over a span of a decade or more over a spatial region that 
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spatial heterogeneity of the distribution of earthquakes in R.^ The success rate of 
the predictions is s = S/A] the fraction of earthquakes successfully predicted is 
p = P/Q; the false alarm rate is / = F/A; and the rate of missed events (failures 
to predict) is m = M/Q. It we raise an alarm for the entire study volume and time 
V, we can ensure that s = p = 1, but then f = 1, so the alarms are not informative. 

Predictions are generally evaluated using a combination of s, p, /, m and v. 
Prediction methods can be ranked by adjusting their tuning parameters so that 
their values of v arc equal, then comparing their values of p, or vice versa. For a 
given alarm volume v, the method with largest p is best. For a given value of p, the 
method with the smallest v is best. Some evaluation strategics fix p and compare 
values of /, or vice versa. 

3.1. Testing strategies 

A common strategy for evaluating earthquake predictions statistically is to compare 
the success of the predictions on the observed seismicity with the success of the same 
predictions on random seismicity (e.g., [22, 31, 39, 51]). This strategy does not make 
sense because predictions usually depend on past seismicity: If the seismicity had 
been different, the predictions would have been different.^*' 

Several stochastic models for seismicity are common in null hypotheses for testing 
predictions. 

1. Some studies model seismicity by a homogeneous Poisson process with inten- 
sity equal to the mean historical rate in the study region (e.g., [I'J]). Some 
studies condition on the number of events and model seismicity as uniform 
over the study region or subsets of the study region [G, 32, 55]. 

2. Some studies use a spatially heterogeneous but temporally homogeneous Pois- 
son process model, with rate in the spatial region Rj equal to the historical 
rate [11, 28]. 

3. Some studies condition on the observed locations of past events, but model 
the times of the events as Poisson or uniform [28, 47]. 

4. Some studies condition on the observed locations and the observed times, but 
model the times as exchangeable [20, 37]. That is, if the observed time of the 
jth event in the catalog is tj, j = 1, . . . ,Q, then, according to the model, it 
is equally likely that the times would have been t^(^j-^, j = 1, . . . , Q, where tt 
is any permutation of {1, ... , Q}. 

In the last approach (the permutation model, sometimes called "randomizing a 
catalog"), times of events in the study region are exchangeable. 

There are variations on these approaches. For example, some researchers try 
to remove putative aftershocks from the catalogs (e.g., [1, 20, 40]). This is called 

^To account for the spatial heterogeneity of events, some authors use normalized counting 
measure in space — based on the historical occurrence of events in a given volume — rather than 
Lebesgue measure. See, e.g., Kossobokov et al. [28]. 

^"This is a bit like the Monte Hall or Let's Make a Deal problem [11], Chapter 10: A prize is 
hidden at random behind one of three doors. The contestant picks a door. The host then reveals 
that the prize is not behind one of the two doors the contestant did not pick. The contestant is 
now allowed to switch his guess to the third door. Should he? Some erroneous arguments assume 
that the door the host opens is independent of which door conceals the prize. That is not a good 
model for the game because the host never opens the door that hides the prize: Which door the 
host opens depends on the contestant's guess and on which door hides the prize. Similarly, holding 
the prediction fixed regardless of the seismicity is not a good model for earthquake prediction. 
Whether a prediction is issued for tomorrow typically depends on whether there is an earthquake 
today. 
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"declustering." The most common method for declustering is to make spatiotempo- 
ral holes in a catalog: After each event, all smaller events that occur within a given 
time interval and epicentral distance are deleted. The time interval and distance 
can depend on the magnitude of the event [23, 27, 38]. (For an alternative approach 
to declustering, see It is common to assume that a dcclustcred catalog is a 

realization of a temporally homogeneous Poisson process. Assessments of earth- 
quake predictions are known to be sensitive to details of declustering and to spatial 
variability of the rate of seismicity [17, 21, 29, 31, 42, 43, 49, -54]. 

Another approach to testing is to compare the success rate of predictions with 
the (theoretical or empirical) success rate of random predictions that do not use 
any seismic information [Ki]. This seems to be a straw-man comparison because 
such random predictions ignore the empirical clustering of seismicity. 

3.2. Jackson, 1996 

Jackson [19] reviews methods for testing deterministic and probabilistic predic- 
tions. The approach to testing deterministic predictions is based on a probability 
distribution for the number of successful predictions, in turn derived from a null 
hypothesis that specifies P(<S'j = 1), j = 1,. . . ,A. Jackson does not say how to 
find these probabilities, although he does say that usually the null hypothesis is 
that seismicity follows a Poisson process with rates equal to the historical rates. He 
assumes that {Sj}^^^ are independent, so S is the sum of A independent Bernoulli 
random variables. Jackson advocates estimating the P- value, P(S' > 5'obgerved)' 
simulating the distribution of the sum of independent Bernoulli variables, and men- 
tions the Poisson approximation as an alternative. See Kagan and Jackson [22] for 
more discussion of the same approaches. Both articles advocate a likelihood-ratio 
test for evaluating probabilistic forecasts. (See [IG] for an application.) They also 
propose a variant of the Neyman-Pcarson testing paradigm in which it is possible 
that both the null hypothesis and the alternative hypothesis are rejected, in effect 
combining a goodness-of-fit test of the null with a likelihood ratio test against the 
alternative. 

3.3. Console, 2001 

Console [s] addresses deterministic predictions and probabilistic forecasts. His dis- 
cussion of deterministic predictions includes several statistics for comparing al- 
ternative sets of predictions. His discussion of probabilistic forecasts is based on 
the likelihood approach in Kagan and Jackson [22], described above. The likelihood 
function assumes that predictions succeed independently, with known probabilities. 
For Console, the null hypothesis is that seismicity has a Poisson distribution ([8], 
page 266). He gives one numerical example of testing a set of four predictions on 
the basis of "probability gain," but no hint as to how to determine the significance 
level or power of such tests. His test rejects the null hypothesis if more events occur 
during alarms than are expected on the assumption that seismicity has a homoge- 
neous Poisson distribution with true rate equal to the observed rate. Console also 
mentions selecting prediction methods on the basis of a risk function, and Bayesian 
methods. The loss function Console contemplates is linear in the number of pre- 
dicted events, the number of unpredicted events, and the total length of alarms, all 

^'^This kind of declustering produces a process that has less clustering than a Poisson process 
because it imposes a minimum distance between events. 
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of which are treated as random. He does not address estimating the risk from data, 
but it seems that any estimate must involve stochastic assumptions about Q, S, F 
and M. 



3.4. Shi, Liu and Zhang, 2001 

Shi, Liu and Zhang [40] evaluate official Chinese earthquake predictions of earth- 
quakes with magnitude 5 and above for 1990-1998. They divide the study region 
into 3,743 small cells in space, and years of time. In a given cell in a given year, 
cither an earthquake is predicted to occur, or — if not — that is considered to be a 
prediction that there will be no event in that cell during that year. They define the 
i?-score as 

^ jj= cells in which earthquakes are successfully predicted 

# cells in which earthquakes occur 
# cells with false alarms 
# aseismic cells ' 

which measures the concordance of the binned data with predictions of occurrence 
and of non-occurrence. In computing the i?-score, they first dccluster the catalog 
using the method of Keilis-Borok et al. [2.]]. Their hypothesis tests use the R- 
score as the test statistic. They compare the i?-score of the actual predictions on 
the declustered catalog with the i?-score of several sets of random predictions, 
generated as follows: 

1. Condition on the number of cells in which earthquakes are predicted to oc- 
cur. Choose that many cells at random without replacement from the 3,743 
cells, with the same chance of selecting each cell; predict that earthquakes of 
magnitude 5 or above will occur in those randomly-selected cells. 

2. To take spatial heterogeneity into account, for the jth cell, toss a pj-coin, 
where pj is proportional to the historical rate of seismicity in that cell. If the 
jth coin lands heads, predict that an earthquake of magnitude 5 or above will 
occur in the jth cell. Toss coins independently for all cells, j = 1, . . . , 3743. 
The constant of proportionality is the ratio of the number of cells for which the 
actual predictions anticipates events, divided by the historical annual average 
number of cells in which events occur. This produces a random number of 
predictions, with predictions more likely in cells where more events occurred 
in the past. 

3. Condition on the number of cells in which earthquakes are predicted to oc- 
cur. Choose that many cells at random without replacement from the 3,743 
cells. Instead of selecting cells with equal probability, select the jth cell with 
probability pj, with pj set as in (2). Predict that earthquakes of magnitude 5 
or above will occur in those randomly-selected cells. 

The third approach is a blend of the first two approaches: The number of simulated 
predictions each year is forced to equal the actual number of predictions, but the 
chance of raising a prediction in the jth cell depends on the historical rate of 
seismicity in the jth cell. None of these three comparison methods depends on the 
observed seismicity during the study period, 1990-1998. In particular, none exploits 
clustering, which is presumed to have been eliminated from the catalog. 
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4. Some claims of successful predictions 
4.1. Wyss and Burford, 1987 

Wyss and Burford [5")] claim to have predicted the magnitude Ml ~ 4.6 earthquake 
that occurred on 31 May 1986 near Stone Canyon, California, about a year before 
it occurred, using "seismic quiescence," an anomalous paucity of earthquakes over 
some period of time. They examine the rates of earthquakes on different sections 
of the San Andreas fault and identify two fault sections in which the rate dropped 
compared with the rates in neighboring sections. They say that "the probability [of 
the prediction] to have come true by chance is < 5%." The probability they calculate 
is the chance that an earthquake would occur in the alarm region, if earthquakes 
occurred at random, independently, uniformly in space and time, with rate equal 
to the historic rate in the study area over the previous decade. In effect, their null 
hypothesis is that seismicity follows a homogeneous Poisson process with rate equal 
to the historical rate; clustering is not taken into account. 

4-2. VAN predictions based on Seismic Electrical Signals 

There has been a lively debate in the literature about whether predictions made by 
Varotsos, Alexopoulos and Nomicos (VAN) [50] of earthquakes in Greece succeeded 
beyond chance. See volume 23 of Geophysical Research Letters (1996). The par- 
ticipants did not even agree about the number of earthquakes that were predicted 
successfully, much less whether the number of successes was surprising. Participants 
disagreed about whether the predictions were too vague to be considered predic- 
tions, whether some aspects of the predictions were adjusted post hoc, what the 
null hypothesis should be, and what tests were appropriate. 

4.3. Kossobokov et al, 1999 

Kossobokov, Romashkova, Keilis-Borok and Healy [28] claim to have predicted four 
of the five magnitude 8 and larger earthquakes that occurred in the circum-Pacific 
region between 1992 and 1997. They say "[t]he statistical significance of the achieved 
results is beyond 99%." (From context, it is clear that they mean that the P-value 
is < 1%.) Their predictions are based on two algorithms, M8 and MSc, which 
track the running mean of the number of main shocks; the difference between the 
cumulative number of main shocks and a windowed trend in the number of main 
shocks; a measure of spatial clustering of main shocks derived from the distance 
between shocks and the diameters of the sources in a temporal window; and the 
largest number of aftershocks of any event in a temporal window. These are used to 
identify "times of increased probability," which are predictions that last five years. 
The declustering method described above was used to classify events as main shocks 
or aftershocks. 

Kossobokov et al. [28] calculate statistical significance by assuming that earth- 
quakes follow a Poisson process that is homogeneous in time but heterogeneous in 
space, with an intensity estimated from the historical rates of seismicity, {Xj}. Kos- 
sobokov et al. [2s] condition on the number of events that occur in the study area, 
which leads to a calculation in which locations and times are iid across events, 
the epicenters and times are independent of each other, the temporal density of 
earthquake times is uniform, and the spatial distribution of epicenters is given by 
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the historical distribution between 1992 and 1997. Their calculation does not take 
temporal clustering into account, and it conditions on the predictions. They calcu- 
late the chance that S or more of the Q events would occur during alarms to be 
Y^x=s Q^x'^^i^ — tt)"^^^, where tt is the normalized measure of the union of the 
alarms. The measure is the product of the uniform measure on time and counting 
measure on space, using the historical distribution of epicenters in the study volume 
to define the counting measure. 

5. A naive predictor 

In this section we exploit the empirical clustering of earthquakes to construct a pre- 
dictor that succeeds far beyond chance according to some tests that hold predictions 
fixed and treat seismicity as random. The chance model for seismicity uses the ob- 
served times and locations of earthquakes, but shuffles the times. That is, according 
to the null hypothesis, the times of events are exchangeable given their locations and 
magnitudes and the predictions. We can simulate from the null model by randomly 
permuting the list of observed times relative to the list of observed locations and 
magnitudes. This is a common "nonparametric" approach to testing earthquake 
predictions [20, 37]. That is, if the location, magnitude and time of the events in 
the catalog are ^^j, we take the Ql outcomes {(fj, Afj, i7r(j))}^i (as 

TT ranges over all Ql permutations of {1, . . . , Q}) to be equally likely under the null 
hypothesis, given the predictions. We do not claim that this predictor or this null 
hypothesis is good. Rather, we claim that this approach to testing is misleading. 

We apply the approach to the Harvard centroid moment tensor (CMT) catalog^^ 
for 2004 and for 2000-2004. We make two sets of predictions: 

(i) After each earthquake of (body-wave) magnitude Air or greater, predict that 
there will be an earthquake of magnitude Mt or greater within 21 days, and 
within 50 km epicentral distance. 

(ii) After each earthquake of magnitude Mr or greater, predict that there will be 
an earthquake within 21 days and within 50 km that is at least as large as 
any within 50 km within 21 days prior to its occurrence. 

Predictor (ii) is equivalent to predictor (i) if an event is deemed eligible for predic- 
tion only if there is no larger event within 50 km in the 21 days leading up to the 
event. 

Let Mj be the magnitude of the jth event that triggers an alarm; let tj be 
the time of the jth event that triggers an alarm; and let Rj be the set of points 
on Earth's surface that are within 50 km of the epicenter of the jth event that 
triggers an alarm. Recall that an alarm is a connected region Vj of space, time, and 
magnitude. For predictor (i), 

(6) Vj = Rj X {tj.tj + 21 days] x [Mr, oo), 

while for predictor (ii), 
(7) 

Vj = {Rjx{t^,t^+21 days]x[Afj,oo)}\ |J {Rkx{tk,tk + 21 days] x [A/^, M^)}. 

k:Mk>Mj 



http : //www. global cmt . org/CMTsearch .html 
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An event at time t, epicenter r and with magnitude M is predicted by the second 
set of predictions if and only if 

(8) M e P|{[Afj, oo) : {t, r) £ {tj,tj + 21 days] x R^}. 

3 

Predictor (i) tends to predict more aftershocks: Large events trigger alarms that 
contain some of their aftershocks. Predictor (ii) prevents aftershocks from being 
predicted by main shocks; however, it does not prevent aftershocks with magnitude 
Mt or larger from predicting still larger aftershocks of the same main event, pro- 
vided the predicted aftershock is the largest event in the preceding 21 days, within 
50 km. Predictors (i) and (ii) generate the same number of alarms and have the 
same total duration, but not the same extent of space and magnitude. Note that 
these alarms need not be disjoint. 

We consider two values of the threshold magnitude Mt-: 5.5 and 5.8. We compare 
the number of events successfully predicted by these two predictors with the distri- 
bution of number that would be predicted successfully if seismicity were "random." 
Using the CMT catalog, we generate a set of alarms. Holding those alarms fixed, we 
see how successful the alarms would be in predicting random seismicity — generated 
by randomly permuting the times in the CMT catalog. 

Table 1 summarizes the results. Under the null hypothesis, both prediction meth- 
ods (i and ii) succeed well beyond chance for CMT data from the year 2004 and the 
years 2000-2004, for both values of the threshold magnitude. That is not because 
the prediction method is good; rather, it is because the stochastic model in the null 
hypothesis fails to take into account the empirical clustering of earthquakes and 

Table 1 

Simulation results using the global Harvard Centroid Moment Tensor ( CMT) catalog. We seek to 
predict events with body-wave magnitude M-r and above. "Events" is the total number of events in 
the time period with magnitude at least . Each event with body-wave magnitude or greater 
triggers an alarm. In each row, the number of alarms is equal to the number of events in column 3. 
The spatial extent of the alarm is a spherical cap of radius 50 km centered at the epicenter of 
the event that triggers the alarm. The temporal extent of the alarm is 21 days, starting at the 
time of the event that triggers the alarm. We set the magnitude extent of alarms in two ways. 
Column 4, 'succ,' is the number of successful predictions using predictor (i): It is the number 
of events with magnitude at least Mt that are within 21 days following and within 50 km of 
the epicenter of an event with magnitude Mr or greater. Column 5, 'succ w/o, ' is the number 
of successful predictions using predictor (ii): It is the number of events that are within 21 days 
following and within 50 km of the epicenter of an event whose magnitude is at least M^- but no 
greater than that of the event in question. Events that follow within 21 days of a larger event 
are not counted. Column 6, 'max sim,' is the largest number of successful predictions in 1,000 
random permutations of the times of the events Harvard CMT catalog, holding the alarms and 
the locations and magnitudes of events in the catalog fixed. The alarms are those corresponding 
to column 5 — predictor (ii) in the text — that is, an event is eligible for prediction only if its 
magnitude exceeds that of every event within 50 km within the 21 days preceding it. Column 1, 
'P-value (est),' is the estimated P-value: the fraction of permutations in which the number of 
successful predictions was greater than or equal to the observed number of successful predictions 
for the CMT catalog. This corresponds to predictor (ii), which is intended to reduce the number 
of predictions satisfied by aftershocks. Column 8, 'v, ' is an upper bound on the fraction of the 
study region (in space and time) covered by alarms; it is not adjusted for overlap of alarms. 



year 




Mr 


events 


succ 


succ 
w/o 


max 
sim 


P-value 
(est) 


V 


2004 




5.5 


445 


95 


30 


28 


<0.001 


3.9 X 10"* 


2004 




5.8 


207 


24 


7 


10 


0.041 


1.8 X 10"* 


2000- 


-2004 


5.5 


2013 


320 


85 


48 


<0.001 


3.6 X 10"* 


2000- 


-2004 


5.8 


996 


114 


29 


19 


<0.001 


1.8 X 10-4 
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the dependence of the predictions on the seismicity. Holding predictions fixed as 
seismicity varies randomly does not make sense. 

6. Conclusions 

Interpreting earthquake predictions is difficult. So is evaluating whether predic- 
tions work. To use a statistical hypothesis test, something must be assumed to be 
random, and its probability distribution under the null hypothesis must be known. 
Many studies model seismicity as random under the null hypothesis. That approach 
has serious drawbacks, and details of the stochastic model, such as spatial hetero- 
geneity, independence or exchangeability, matter for testing. Most null hypotheses 
used in tests ignore the empirical clustering of earthquakes. Some try to remove 
clustering with ad hoc adjustments as a prelude to probability calculations. It is 
implausible that the resulting data represent a realization of a Poisson process, as is 
often assumed. The standard approach to testing — hold the predictions fixed while 
seismicity varies randomly according to some stochastic model — does not take into 
account that in practice, the predictions would be different if the seismicity were 
different. The result is that simple-minded schemes, such as the "automatic alarm 
strategy," can succeed well beyond chance in hypothesis tests. This is not because 
the predictions arc good: It is because the tests are bad. 

Acknowledgments. We arc grateful to David A. Frecdman, Cliff Frohlich, 
Robert Geller and Yan Y. Kagan for helpful conversations. 
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